Welcome to Boston Massachusetts in the 1970s! Imagine you're working for a real estate development company. Your company wants to value any residential project before they start. You are tasked with building a model that can provide a price estimate based on a home's characteristics like:
To accomplish your task you will:
Google Colab may not be running the latest version of plotly. If you're working in Google Colab, uncomment the line below, run the cell, and restart your notebook server.
#%pip install --upgrade plotly
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# TODO: Add missing import statements
from sklearn.model_selection import train_test_split
pd.options.display.float_format = '{:,.2f}'.format
The first column in the .csv file just has the row numbers, so it will be used as the index.
data = pd.read_csv('boston.csv', index_col=0)
Characteristics:
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive. The Median Value (attribute 14) is the target.
:Attribute Information (in order):
1. CRIM per capita crime rate by town
2. ZN proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS proportion of non-retail business acres per town
4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. NOX nitric oxides concentration (parts per 10 million)
6. RM average number of rooms per dwelling
7. AGE proportion of owner-occupied units built prior to 1940
8. DIS weighted distances to five Boston employment centres
9. RAD index of accessibility to radial highways
10. TAX full-value property-tax rate per $10,000
11. PTRATIO pupil-teacher ratio by town
12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT % lower status of the population
14. PRICE Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset. This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. You can find the original research paper here.
Challenge
data?data.shape
(506, 14)
data.count()
CRIM 506 ZN 506 INDUS 506 CHAS 506 NOX 506 RM 506 AGE 506 DIS 506 RAD 506 TAX 506 PTRATIO 506 B 506 LSTAT 506 PRICE 506 dtype: int64
data.columns
Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT', 'PRICE'],
dtype='object')
data.head()
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | PRICE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.01 | 18.00 | 2.31 | 0.00 | 0.54 | 6.58 | 65.20 | 4.09 | 1.00 | 296.00 | 15.30 | 396.90 | 4.98 | 24.00 |
| 1 | 0.03 | 0.00 | 7.07 | 0.00 | 0.47 | 6.42 | 78.90 | 4.97 | 2.00 | 242.00 | 17.80 | 396.90 | 9.14 | 21.60 |
| 2 | 0.03 | 0.00 | 7.07 | 0.00 | 0.47 | 7.18 | 61.10 | 4.97 | 2.00 | 242.00 | 17.80 | 392.83 | 4.03 | 34.70 |
| 3 | 0.03 | 0.00 | 2.18 | 0.00 | 0.46 | 7.00 | 45.80 | 6.06 | 3.00 | 222.00 | 18.70 | 394.63 | 2.94 | 33.40 |
| 4 | 0.07 | 0.00 | 2.18 | 0.00 | 0.46 | 7.15 | 54.20 | 6.06 | 3.00 | 222.00 | 18.70 | 396.90 | 5.33 | 36.20 |
data.isna().values.any()
False
data.duplicated().values.any() # no duplicates
False
data.isna().values.any() # no null values
False
data.info()
<class 'pandas.core.frame.DataFrame'> Index: 506 entries, 0 to 505 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CRIM 506 non-null float64 1 ZN 506 non-null float64 2 INDUS 506 non-null float64 3 CHAS 506 non-null float64 4 NOX 506 non-null float64 5 RM 506 non-null float64 6 AGE 506 non-null float64 7 DIS 506 non-null float64 8 RAD 506 non-null float64 9 TAX 506 non-null float64 10 PTRATIO 506 non-null float64 11 B 506 non-null float64 12 LSTAT 506 non-null float64 13 PRICE 506 non-null float64 dtypes: float64(14) memory usage: 59.3 KB
Challenge
CHAS feature?CHAS and why?data.PTRATIO.mean()
18.455533596837945
PTRATIO for students for teacher
CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
RM for rooms per dwelling
data.describe()
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | PRICE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 |
| mean | 3.61 | 11.36 | 11.14 | 0.07 | 0.55 | 6.28 | 68.57 | 3.80 | 9.55 | 408.24 | 18.46 | 356.67 | 12.65 | 22.53 |
| std | 8.60 | 23.32 | 6.86 | 0.25 | 0.12 | 0.70 | 28.15 | 2.11 | 8.71 | 168.54 | 2.16 | 91.29 | 7.14 | 9.20 |
| min | 0.01 | 0.00 | 0.46 | 0.00 | 0.39 | 3.56 | 2.90 | 1.13 | 1.00 | 187.00 | 12.60 | 0.32 | 1.73 | 5.00 |
| 25% | 0.08 | 0.00 | 5.19 | 0.00 | 0.45 | 5.89 | 45.02 | 2.10 | 4.00 | 279.00 | 17.40 | 375.38 | 6.95 | 17.02 |
| 50% | 0.26 | 0.00 | 9.69 | 0.00 | 0.54 | 6.21 | 77.50 | 3.21 | 5.00 | 330.00 | 19.05 | 391.44 | 11.36 | 21.20 |
| 75% | 3.68 | 12.50 | 18.10 | 0.00 | 0.62 | 6.62 | 94.07 | 5.19 | 24.00 | 666.00 | 20.20 | 396.23 | 16.96 | 25.00 |
| max | 88.98 | 100.00 | 27.74 | 1.00 | 0.87 | 8.78 | 100.00 | 12.13 | 24.00 | 711.00 | 22.00 | 396.90 | 37.97 | 50.00 |
Challenge: Having looked at some descriptive statistics, visualise the data for your model. Use Seaborn's .displot() to create a bar chart and superimpose the Kernel Density Estimate (KDE) for the following variables:
Try setting the aspect parameter to 2 for a better picture.
What do you notice in the distributions of the data?
data.head()
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | PRICE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.01 | 18.00 | 2.31 | 0.00 | 0.54 | 6.58 | 65.20 | 4.09 | 1.00 | 296.00 | 15.30 | 396.90 | 4.98 | 24.00 |
| 1 | 0.03 | 0.00 | 7.07 | 0.00 | 0.47 | 6.42 | 78.90 | 4.97 | 2.00 | 242.00 | 17.80 | 396.90 | 9.14 | 21.60 |
| 2 | 0.03 | 0.00 | 7.07 | 0.00 | 0.47 | 7.18 | 61.10 | 4.97 | 2.00 | 242.00 | 17.80 | 392.83 | 4.03 | 34.70 |
| 3 | 0.03 | 0.00 | 2.18 | 0.00 | 0.46 | 7.00 | 45.80 | 6.06 | 3.00 | 222.00 | 18.70 | 394.63 | 2.94 | 33.40 |
| 4 | 0.07 | 0.00 | 2.18 | 0.00 | 0.46 | 7.15 | 54.20 | 6.06 | 3.00 | 222.00 | 18.70 | 396.90 | 5.33 | 36.20 |
sns.displot(data=data.PRICE,
aspect=2,
bins=50,
kde=True,
color='green'
)
plt.xlabel("Price in $1000\'s")
plt.ylabel("Number of Homes")
plt.title("1970\'s Home Values in Boston")
plt.show()
sns.displot(data=data.DIS,
aspect=2,
bins=50,
kde=True,
color='red'
)
plt.xlabel("Distance to Employment Centres.")
plt.ylabel("Number of Homes")
plt.title("1970\'s Home Values in Boston")
plt.show()
sns.displot(data=data.RM,
aspect=2,
bins=20,
kde=True,
color='purple'
)
plt.xlabel("Distribution of Rooms")
plt.ylabel("Number of Homes")
plt.title("Average nunber of rooms in 1970(Boston)")
plt.show()
sns.displot(data.RAD,
bins=50,
ec='black',
color='#7b1fa2')
plt.xlabel('Accessibility to Highways')
plt.ylabel('Number of Houses')
plt.title("Access to Radial Highway in 1970's(Boston)")
plt.show()
Challenge
Create a bar chart with plotly for CHAS to show many more homes are away from the river versus next to it. The bar chart should look something like this:
You can make your life easier by providing a list of values for the x-axis (e.g., x=['No', 'Yes'])
chas = data.CHAS.value_counts()
bar_chart = px.bar(x=['No', 'Yes'],
y=chas.values,
title='Next to Charles River?',
color=chas.values,
)
bar_chart.update_layout(xaxis_title='Property Located Next to the River?',
yaxis_title='Number of Homes',
coloraxis_showscale=False)
bar_chart.show()
Challenge
There might be some relationships in the data that we should know about. Before you run the code, make some predictions:
Run a Seaborn .pairplot() to visualise all the relationships at the same time. Note, this is a big task and can take 1-2 minutes! After it's finished check your intuition regarding the questions above on the pairplot.
sns.pairplot(data)
plt.show()
Challenge
Use Seaborn's .jointplot() to look at some of the relationships in more detail. Create a jointplot for:
Try adding some opacity or alpha to the scatter plots using keyword arguments under joint_kws.
Challenge:
Compare DIS (Distance from employment) with NOX (Nitric Oxide Pollution) using Seaborn's .jointplot(). Does pollution go up or down as the distance increases?
with sns.axes_style('darkgrid'):
sns.jointplot(x=data['DIS'],
y=data['NOX'],
kind='hex',
joint_kws={'alpha':0.5})
Challenge:
Compare INDUS (the proportion of non-retail industry i.e., factories) with NOX (Nitric Oxide Pollution) using Seaborn's .jointplot(). Does pollution go up or down as there is a higher proportion of industry?
with sns.axes_style('darkgrid'):
sns.jointplot(x=data['NOX'],
y=data['INDUS'],
height=6,
color='darkblue',
kind='hex',
joint_kws={'alpha':0.5})
plt.show()
Challenge
Compare LSTAT (proportion of lower-income population) with RM (number of rooms) using Seaborn's .jointplot(). How does the number of rooms per dwelling vary with the poverty of area? Do homes have more or fewer rooms when LSTAT is low?
with sns.axes_style('darkgrid'):
sns.jointplot(x=data.LSTAT,
y=data.RM,
kind='hex',
height=7,
color='purple',
joint_kws={'alpha':0.5})
plt.show()
Challenge
Compare LSTAT with PRICE using Seaborn's .jointplot(). How does the proportion of the lower-income population in an area affect home prices?
with sns.axes_style('darkgrid'):
sns.jointplot(x=data.LSTAT,
y=data.PRICE,
kind='hex',
height=7,
color='crimson',
joint_kws={'alpha':0.5})
plt.show()
Challenge
Compare RM (number of rooms) with PRICE using Seaborn's .jointplot(). You can probably guess how the number of rooms affects home prices. 😊
with sns.axes_style('whitegrid'):
sns.jointplot(x=data.RM,
y=data.PRICE,
height=7,
kind='hex',
color='darkblue',
joint_kws={'alpha':0.5})
plt.show()
We can't use all 506 entries in our dataset to train our model. The reason is that we want to evaluate our model on data that it hasn't seen yet (i.e., out-of-sample data). That way we can get a better idea of its performance in the real world.
Challenge
train_test_split() function from sklearnrandom_state=10. This helps us get the same results every time and avoid confusion while we're learning.Hint: Remember, your target is your home PRICE, and your features are all the other columns you'll use to predict the price.
target = data['PRICE']
features = data.drop('PRICE', axis=1)
X_train, X_test, y_train, y_test = train_test_split(features,
target,
test_size=0.2,
random_state=10)
# % of training set
train_pct = 100*len(X_train)/len(features)
print(f'Training data is {train_pct:.3}% of the total data.')
# % of test data set
test_pct = 100*X_test.shape[0]/features.shape[0]
print(f'Test data makes up the remaining {test_pct:0.3}%.')
Training data is 79.8% of the total data. Test data makes up the remaining 20.2%.
In a previous lesson, we had a linear model with only a single feature (our movie budgets). This time we have a total of 13 features. Therefore, our Linear Regression model will have the following form:
$$ PR \hat ICE = \theta _0 + \theta _1 RM + \theta _2 NOX + \theta _3 DIS + \theta _4 CHAS ... + \theta _{13} LSTAT$$Challenge
Use sklearn to run the regression on the training dataset. How high is the r-squared for the regression on the training data?
regression = LinearRegression()
regression.fit(X_train, y_train)
rsquared = regression.score(X_train, y_train)
rsquared
0.750121534530608
Here we do a sense check on our regression coefficients. The first thing to look for is if the coefficients have the expected sign (positive or negative).
Challenge Print out the coefficients (the thetas in the equation above) for the features. Hint: You'll see a nice table if you stick the coefficients in a DataFrame.
coef = pd.DataFrame(data=regression.coef_, index=X_train.columns, columns=['Coefficient'])
coef
| Coefficient | |
|---|---|
| CRIM | -0.13 |
| ZN | 0.06 |
| INDUS | -0.01 |
| CHAS | 1.97 |
| NOX | -16.27 |
| RM | 3.11 |
| AGE | 0.02 |
| DIS | -1.48 |
| RAD | 0.30 |
| TAX | -0.01 |
| PTRATIO | -0.82 |
| B | 0.01 |
| LSTAT | -0.58 |
coef.loc['RM'].values[0] * 1000 # the price premium is in $
3108.4562454033
The next step is to evaluate our regression. How good our regression is depends not only on the r-squared. It also depends on the residuals - the difference between the model's predictions ($\hat y_i$) and the true values ($y_i$) inside y_train.
predicted_values = regr.predict(X_train)
residuals = (y_train - predicted_values)
Challenge: Create two scatter plots.
The first plot should be actual values (y_train) against the predicted value values:
The cyan line in the middle shows y_train against y_train. If the predictions had been 100% accurate then all the dots would be on this line. The further away the dots are from the line, the worse the prediction was. That makes the distance to the cyan line, you guessed it, our residuals 😊
The second plot should be the residuals against the predicted prices. Here's what we're looking for:
predicted_vals = regression.predict(X_train)
residuals = (y_train - predicted_vals)
# Original Regression of Actual vs. Predicted Prices
plt.figure(dpi=100)
plt.scatter(x=y_train, y=predicted_vals, c='lightblue', alpha=0.6)
plt.plot(y_train, y_train, color='red')
plt.title(f'Actual vs Predicted Prices: $y _i$ vs $\hat y_i$', fontsize=17)
plt.xlabel('Actual prices 000s $y _i$', fontsize=14)
plt.ylabel('Prediced prices 000s $\hat y _i$', fontsize=14)
plt.show()
# Residuals vs Predicted values
plt.figure(dpi=100)
plt.scatter(x=predicted_vals, y=residuals, c='lightblue', alpha=0.6)
plt.title('Residuals vs Predicted Values', fontsize=17)
plt.xlabel('Predicted Prices $\hat y _i$', fontsize=14)
plt.ylabel('Residuals', fontsize=14)
plt.show()
Why do we want to look at the residuals? We want to check that they look random. Why? The residuals represent the errors of our model. If there's a pattern in our errors, then our model has a systematic bias.
We can analyse the distribution of the residuals. In particular, we're interested in the skew and the mean.
In an ideal case, what we want is something close to a normal distribution. A normal distribution has a skewness of 0 and a mean of 0. A skew of 0 means that the distribution is symmetrical - the bell curve is not lopsided or biased to one side. Here's what a normal distribution looks like:
Challenge
.displot() to create a histogram and superimpose the Kernel Density Estimate (KDE)res_mean = round(residuals.mean(), 2)
res_skew = round(residuals.skew(), 2)
sns.displot(residuals, kde=True, color='blue')
plt.title(f'Residuals Skew ({res_skew}) Mean ({res_mean})')
plt.show()
We have two options at this point:
Let's try a data transformation approach.
Challenge
Investigate if the target data['PRICE'] could be a suitable candidate for a log transformation.
.displot() to show a histogram and KDE of the price data.log() function to create a Series that has the log prices.displot() and calculate the skew.price_skew = data['PRICE'].skew()
sns.displot(data['PRICE'], kde='kde', color='green')
plt.title(f'Normal Prices. Skew is {price_skew:.3}')
plt.show()
y_log = np.log(data['PRICE'])
sns.displot(y_log, kde=True, color='green')
plt.title(f'Log Prices. Skew is {y_log.skew():.3}')
plt.show()
Using a log transformation does not affect every price equally. Large prices are affected more than smaller prices in the dataset. Here's how the prices are "compressed" by the log transformation:
We can see this when we plot the actual prices against the (transformed) log prices.
plt.figure(dpi=150)
plt.scatter(data.PRICE, np.log(data.PRICE))
plt.title('Mapping the Original Price to a Log Price')
plt.ylabel('Log Price')
plt.xlabel('Actual $ Price in 000s')
plt.show()
Using log prices instead, our model has changed to:
$$ \log (PR \hat ICE) = \theta _0 + \theta _1 RM + \theta _2 NOX + \theta_3 DIS + \theta _4 CHAS + ... + \theta _{13} LSTAT $$Challenge:
train_test_split() with the same random state as before to make the results comparable.new_target = np.log(data['PRICE'])
features = data.drop('PRICE', axis=1)
X_train, X_test, log_y_train, log_y_test = train_test_split(features,
new_target,
test_size=0.2,
random_state=10)
log_regr = LinearRegression()
log_regr.fit(X_train, log_y_train)
log_rsquared = log_regr.score(X_train, log_y_train)
log_predictions = log_regr.predict(X_train)
log_residuals = (log_y_train - log_predictions)
print(f'Training data r-squared: {log_rsquared:.2}')
Training data r-squared: 0.79
# 79 compared to 75
Challenge: Print out the coefficients of the new regression model.
Hint: Use a DataFrame to make the output look pretty.
df_coef = pd.DataFrame(data=log_regr.coef_, index=X_train.columns, columns=['coef'])
df_coef
| coef | |
|---|---|
| CRIM | -0.01 |
| ZN | 0.00 |
| INDUS | 0.00 |
| CHAS | 0.08 |
| NOX | -0.70 |
| RM | 0.07 |
| AGE | 0.00 |
| DIS | -0.05 |
| RAD | 0.01 |
| TAX | -0.00 |
| PTRATIO | -0.03 |
| B | 0.00 |
| LSTAT | -0.03 |
# vs coef
coef
| Coefficient | |
|---|---|
| CRIM | -0.13 |
| ZN | 0.06 |
| INDUS | -0.01 |
| CHAS | 1.97 |
| NOX | -16.27 |
| RM | 3.11 |
| AGE | 0.02 |
| DIS | -1.48 |
| RAD | 0.30 |
| TAX | -0.01 |
| PTRATIO | -0.82 |
| B | 0.01 |
| LSTAT | -0.58 |
Challenge:
indigo as the colour for the original regression and navy for the color using log prices.# Graph of Actual vs. Predicted Log Prices
plt.scatter(x=log_y_train, y=log_predictions, c='navy', alpha=0.6)
plt.plot(log_y_train, log_y_train, color='cyan')
plt.title(f'Actual vs Predicted Log Prices: $y _i$ vs $\hat y_i$ (R-Squared {log_rsquared:.2})', fontsize=17)
plt.xlabel('Actual Log Prices $y _i$', fontsize=14)
plt.ylabel('Prediced Log Prices $\hat y _i$', fontsize=14)
plt.show()
# Original Regression of Actual vs. Predicted Prices
plt.scatter(x=y_train, y=predicted_vals, c='indigo', alpha=0.6)
plt.plot(y_train, y_train, color='cyan')
plt.title(f'Original Actual vs Predicted Prices: $y _i$ vs $\hat y_i$ (R-Squared {rsquared:.3})', fontsize=17)
plt.xlabel('Actual prices 000s $y _i$', fontsize=14)
plt.ylabel('Prediced prices 000s $\hat y _i$', fontsize=14)
plt.show()
# Residuals vs Predicted values (Log prices)
plt.scatter(x=log_predictions, y=log_residuals, c='navy', alpha=0.6)
plt.title('Residuals vs Fitted Values for Log Prices', fontsize=17)
plt.xlabel('Predicted Log Prices $\hat y _i$', fontsize=14)
plt.ylabel('Residuals', fontsize=14)
plt.show()
# Residuals vs Predicted values
plt.scatter(x=predicted_vals, y=residuals, c='indigo', alpha=0.6)
plt.title('Original Residuals vs Fitted Values', fontsize=17)
plt.xlabel('Predicted Prices $\hat y _i$', fontsize=14)
plt.ylabel('Residuals', fontsize=14)
plt.show()
# copied from solution :)
Challenge:
Calculate the mean and the skew for the residuals using log prices. Are the mean and skew closer to 0 for the regression using log prices?
log_resid_mean = round(log_residuals.mean(), 2)
log_resid_skew = round(log_residuals.skew(), 2)
sns.displot(log_residuals,
kde=True,
color='green')
plt.title(f'Log price model: Residuals Skew ({log_resid_skew}) Mean ({log_resid_mean})')
plt.show()
sns.displot(residuals,
kde=True,
color='red')
plt.title(f'Original model: Residuals Skew ({resid_skew}) Mean ({resid_mean})')
plt.show()
The real test is how our model performs on data that it has not "seen" yet. This is where our X_test comes in.
Challenge
Compare the r-squared of the two models on the test dataset. Which model does better? Is the r-squared higher or lower than for the training dataset? Why?
print(f'Original Model r-squared: {regression.score(X_test, y_test):.2}')
print(f'Log Model r-squared: {log_regr.score(X_test, log_y_test):.2}')
Original Model r-squared: 0.67 Log Model r-squared: 0.74
Our preferred model now has an equation that looks like this:
$$ \log (PR \hat ICE) = \theta _0 + \theta _1 RM + \theta _2 NOX + \theta_3 DIS + \theta _4 CHAS + ... + \theta _{13} LSTAT $$The average property has the mean value for all its charactistics:
# Starting Point: Average Values in the Dataset
features = data.drop(['PRICE'], axis=1)
average_vals = features.mean().values
property_stats = pd.DataFrame(data=average_vals.reshape(1, len(features.columns)),
columns=features.columns)
property_stats
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.61 | 11.36 | 11.14 | 0.07 | 0.55 | 6.28 | 68.57 | 3.80 | 9.55 | 408.24 | 18.46 | 356.67 | 12.65 |
Challenge
Predict how much the average property is worth using the stats above. What is the log price estimate and what is the dollar estimate? You'll have to reverse the log transformation with .exp() to find the dollar value.
estimate_of_log = log_regr.predict(property_stats)[0]
print(f'The log price estimate is ${estimate_of_log:.3}')
The log price estimate is $3.03
dollar_est = np.exp(estimate_of_log) * 1000
print(f'The property is estimated to be worth ${dollar_est:.6}')
The property is estimated to be worth $20703.2
Challenge
Keeping the average values for CRIM, RAD, INDUS and others, value a property with the following characteristics:
# Define Property Characteristics
next_to_river = True
nr_rooms = 8
students_per_classroom = 20
distance_to_town = 5
pollution = data.NOX.quantile(q=0.75) # high
amount_of_poverty = data.LSTAT.quantile(q=0.25) # low
# Solution:
property_stats['RM'] = nr_rooms
property_stats['PTRATIO'] = students_per_classroom
property_stats['DIS'] = distance_to_town
if next_to_river:
property_stats['CHAS'] = 1
else:
property_stats['CHAS'] = 0
property_stats['NOX'] = pollution
property_stats['LSTAT'] = amount_of_poverty
log_estimate = log_regr.predict(property_stats)[0]
print(f'The log price estimate is ${log_estimate:.3}')
dollar_est = np.e**log_estimate * 1000
print(f'The property is estimated to be worth ${dollar_est:.6}')
The log price estimate is $3.25 The property is estimated to be worth $25792.0